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FOREWORD 


Density  estimation  plays  a  central  role  in  probabilistic  pattern  recognition  and  signal 
processing.  As  data  sets  get  larger,  the  cost  of  identifying  a  definitive  class  with  each  observation 
can  become  prohibitive.  Instead,  it  becomes  important  to  develop  ways  to  process  the  data  in  ways 
that  make  use  of  all  available  information. 

This  work  was  done  under  the  joint  support  of  the  Naval  Surface  Warfare  Center,  Dahlgren 
Division,  Dahlgren,  Virginia,  Independent  Research  Program  and  the  Office  of  Naval  Research 
(R&T  No.  4424314). 
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INTRODUCTION 

Finite  mixture  models  have  proven  to  be  quite  flexible  as  parametric  probability  density 
function  estimators.1,2  Recently  an  adaptive  mixture  model  was  presented  whose  complexity  or 
number  of  terms  is  determined  in  a  data  driven  manner.3  This  approach  has  made  possible  the  use 
of  mixture  models  within  a  semi-parametric  setting,  and  thus  of  much  more  general  applicability/ 
utility  than  was  possible  under  rigid  parametric  assumptions. 

This  semiparametric  use  of  mixture  models  has  resulted  in  efforts  to  develop  alternative 
adaptive  mixture  model  algorithms.4,5  Recent  applications  of  semiparametric  mixture  model  den¬ 
sity  estimation  can  be  found  in  References  6,7,8,9,10,  and  11.  Thus,  in  addition  to  the  traditional 
parametric  uses  of  mixture  models,  the  semi-parametric  application  of  mixture  models  is  now 
well  established. 

One  of  the  problems  that  arises  in  many  large  scale  applications  of  mixture  models  to  den¬ 
sity  estimation  is  that,  as  the  size  of  the  data  set  increases,  the  class  labeled  data  becomes  a  subset 
of  the  total  data  set.  That  is,  while  many  small  data  sets  may  have  all  the  observations  labeled  as 
to  class  membership,  large  data  sets  often  consist  of  labeled  subsets  plus  a  potentially  large  unla¬ 
beled  subset. 

The  reason  for  this  can  be  illustrated  with  an  image  processing  example.  Suppose  that  fea¬ 
tures  are  to  be  computed  for  each  pixel  for  a  number  of  images  and  that  densities  are  to  be  com¬ 
puted  for  each  class.  Depending  on  the  problem,  the  classes  may  correspond  to  vehicles, 
buildings,  woods,  and  open  terrain,  or  to  tumorous  and  nontumorous  tissue.  If  all  the  available 
data  is  to  be  used,  the  work  in  allocating  each  original  pixel  to  one  of  the  classes  can  easily 
become  prohibitive.  The  more  usual  case  is  that  only  a  representative  subset  of  training  data  are 
class  labeled  with  the  balance  either  uncategorized  or  partially  categorized.  An  example  of  the  lat¬ 
ter  case  is  that  it  may  be  easy  to  say  that  there  are  no  vehicles  in  this  image,  no  buildings  in  that 
one,  and  so  on  but  difficult  or  time  consuming  to  identify  each  pixel  corresponding  to  each  class 
in  each  image.  It  is  often  the  case  in  medical  imagery  that  ground  truth  cannot  be  established 
definitively  without  a  biopsy,  again  leading  to  less  than  full  categorization  of  the  observations. 

Thus  it  is  desirable  to  have  a  unified  framework  for  handling  this  combined  supervised 
(class  labeled  data)/unsupervised  (unlabeled  data)  problem.  This  is  the  motivation  behind  the  fol¬ 
lowing  development  of  joint  representation  mixture  models. 

This  report  begins  with  the  formulation  of  traditional  finite  mixture  models  and  proceeds 
to  develop  the  joint  representation  mixture  model  equations. 
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JOINT  REPRESENTATION  MIXTURE  MODELS 

FINITE  MIXTURE  MODELS 


Given  a  probability  density  function  that  can  be  represented  as  a  finite  (g  term)  mixture 

model 


8 

P  (*|Y)  =  X  V(*;0/)  ’  (!) 

1=1 

where  /(»;0)  denotes  a  generic  member  of  the  chosen  parametric  family,  the  likelihood  function 
for  a  set  of  n  observations  drawn  from  the  particular  density  p  (x|\j/)  can  be  written  as 

n  n  g 

My)  =  Ebmv)  =  niW-  (2) 

7=1  7=11=1 

The  vector  0;  represents  the  parameter  set  for  the  ith  mixture  component,  while  \|/  represents  the 
combined  total  parameter  set  including  the  mixing  coefficients  7tj.  The  log-likelihood  function  is 


n 


8 


InL  (\jr)  =  £/n 
7=1 


(3) 


The  maximum  likelihood  update  equations  can  be  obtained  by  taking  derivatives  of  the  log-likeli¬ 
hood  function  with  respect  to  the  mixture  model  parameters,  setting  the  resulting  expressions 
equal  to  zero,  and  solving  for  the  parameters. 


JOINT  REPRESENTATION  MIXTURE  MODELS 

The  Joint  Representation  Mixture  Model  is  defined  by 

8  M  g  M 

P  (*|Y)  =  X^(term  Op  (*|  term  i)  ^  p  (class  m|  term  i)  =  ^  (4) 

1=1  m  =  1  i'=l  m  =  1 

where  £I#M  =  p  (class  m|term  i)  is  an  intra-term  class  mixing  coefficient  that  gives  the  relative 
proportion  of  the  ith  term  associated  with  the  mth  class  with  the  constraint 
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M  M 

X  =  X  P  (claSS  mlterm  0  =  1  •  (5) 

/«  =  1  m  =  1 

This  constraint  merely  says  that  for  each  term  independently,  the  class  mixing  coefficients  must 
sum  to  one,  or  equivalently,  that  an  observation  from  term  i  must  belong  to  one  of  the  M  classes 
with  probability  one. 

The  mixture  model  defined  in  Equations  (4)  and  (5)  represents  a  significant  departure  from 
traditional  mixture  model  usage.  Historically,  a  single  mixture  model  has  been  used  for  either  per¬ 
forming  unsupervised  clustering  or  to  generate  a  probability  density  function  for  observations 
from  a  single  class.  If  observations  from  multiple  classes  are  to  be  dealt  with,  then  a  separate  mix¬ 
ture  model  is  developed  for  data  from  each  class.  This  latter  approach  leaves  open  the  question  of 
how  to  incorporate  partially  class  categorized  or  uncategorized  observations  when  there  are  sepa¬ 
rate  mixture  models  for  each  class.  As  will  be  seen,  this  new  formulation  leads  to  a  unified  treat¬ 
ment  of  these  cases. 


JOINT  REPRESENTATION  MIXTURE  MODEL  LIKELIHOOD  FUNCTIONS 
The  likelihood  function  for  class  categorized  data  is 


nc  M 


MV)  =  nn  {  [p  (Xj\  (V  n  Class  m))]  Jm} 

j  =  lm=  1 


nc  M 

=  nn 


j  =  1  m  =  1  w  =  1 


(6) 


Here,  zjm  is  a  binary  valued  class  indicator  function.  For  observation  j  from  class  h,  Zjh  =  1  and  Zjm 
=  0  for  m&h.  It  thus  can  be  considered  as  a  picking  function.  It  is  used  to  “pick  out”  the  desired 
contribution  to  the  likelihood  function.  For  each  term  in  the  product  where  it  has  the  value  zero, 
the  contribution  to  the  product  is  one  so  that  the  likelihood  is  unaffected.  The  log-likelihood  func¬ 
tion  for  class  categorized  data  can  be  written 


«C  M 

lnLc(\\f)  =  £  2>' 

J=  lm=l  L*  =  1 
nc  M  r  8  "I 

=  X  I  zjmln  j  X 

y  = 1 m  =  1  i=l  J 

nc  g  M 

=  X'»X  X  *;»£<.  [*/<*/<»]• 

7=1  i = 1 m = 1 


(7) 
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To  derive  the  likelihood  for  partially  (class)  categorized  data,  first  consider  the  likelihood 
appropriate  for  the  case  where  the  data  is  both  class  and  term  categorized.  In  this  case  the  likeli¬ 
hood  is 


nc  M  g 

l«  (v)  =  n  n  n  <  i  •  <»> 

j  =  1  m  =  1  i  =  1 

Where,  as  before,  Zjm  ( Zji )  is  a  binary  valued  class  (term)  indicator  function.  For  observation  j 
from  class  h,  zjh  =  1  and  Z;m  =  0  for  m  -t  h,  while  for  observation  j  from  term  k,  Zji  =  1  and  zji  =  0  for 
i^k. 


In  the  absence  of  complete  knowledge  of  z]m  and/or  zj;  the  usual  procedure  is  to  use 
expected  values  for  either/both  as 


and 


E  \ jm 


E  “  <r 


Then  for  partially  (class)  categorized  data,  the  likelihood  is 


(9) 


(10) 


np  g  M 


5.  t?.. 


whence 


-  nnimw^ 

j  =  1  i  =  1  m  =  1 
nP  g  M 

lnLp{\ ]f)  =  111  [^/(x/e,.)]} 

/  =  1  /  =  1  m  =  1 

nP  *  M  1  J>„ 

=  11  S  M 

j=  li=  lm=  1 


(ID 


(12) 


where  hm  is  a  prior  or  expected  probability  of  class  membership  with 


M 


1  I U  =  i. 


m  =  1 


and 


=  Tt/(Xj;  0.) 
V  8 

i  =  1 


(13) 


(14) 


4 


NSWCDD/TR-95/92 


is  the  expectation  (posterior  probability)  that  the  jth  observation  came  from  the  ith  mixture  term. 
Since  this  is  an  expectation,  it  will  be  held  fixed  while  taking  derivatives.  Two  common  methods 
for  specifying  partial  categorization  are  (1)  to  give  prior  probabilities  for  each  class  for  observa¬ 
tion  j  or  (2)  to  specify  that  the  priors  are  zero  for  some  subset  of  the  classes  and  to  use  posterior 
probabilities  across  the  remaining  possible  classes  for  the  “unknown”  ^m. 

For  uncategorized  data, 


mv>  =  n 


j=  i 


g  M 

X  X 


■i  =  1  m  =  1 


=  n  x  ivwi  • 


y  =  l  *  =  i 


nu  g  M  nu  g 

lnLu  (V)  =  X  ln  S  X  =  X  ln  X  ^/(x.tf,.)  ] 

7=1  j  =  1 m =  1  y=l  i  = 1 

For  combined  categorized/partially  categorized/uncategorized  data. 


(15) 


(16) 


InL  (\|/)  =  lnLc  (\|t)  +lnLp(\\f)  +lnLu(\\f ) 


(17) 


g  M 

InL  (\|/)  =  X'"X 

7=1  i = 1 m =  1 

np  M  g  nu  g 

+  X  X  X  W"  ] +  X X  • 

7  = 1 m  =  1 ; =  1  7=1  i  = 1 


(18) 


Historically  with  mixture  models,  reference  to  categorization  of  data  has  been  with  respect 
to  which  term  of  the  mixture  model  is  associated  with  a  given  observation.  While  this  is  logical 
when  each  term  is  ascribed  a  “class”  status  as  in  clustering,  in  this  work  a  completely  different 
definition  of  categorized  data  is  being  dealt  with.  In  this  case,  the  concern  is  that  of  categorizing 
data  only  with  respect  to  class  membership,  rather  than  with  respect  to  individual  mixture  model 
terms. 


DERIVATION  OF  MAXIMUM  LIKELIHOOD  E-M  EQUATIONS 

The  log-likelihood  is  to  be  maximized  with  respect  to  variation  of  the  parameter  set  \| f.  The 
usual  procedure  of  taking  the  derivative  of  the  log-likelihood  with  respect  to  each  of  the  parame¬ 
ters,  setting  the  resultant  expressions  equal  to  zero,  and  solving  for  resultant  expressions  that  must 
be  satisfied  for  a  maxima  is  followed.  Since  these  expressions  comprise  a  coupled  set  of  nonlinear 
equations,  convergence  must  be  obtained  through  iteration  which  is  just  the  E-M  algorithm. 1 
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While  the  resultant  equations  for  £  and  7t  are  independent  of  the  parametric  function 
/(•;0)  used  for  the  mixture  model,  the  remaining  equations  are  not.  Therefore,  in  the  following 
derivation,  it  is  assumed  that 


/(*;e,-)  =  = 


1 


exp 


(19) 


That  is,  the  mixture  models  are  mixtures  of  univariate  normals,  each  of  which  is  parametrized  by 
mean  p  and  variance  Z. 


E-M  Equations  for  C 

First,  derive  the  equations  for  the  parameter  Taking  the  derivatives  of  each  of  the  three 
log-likelihood  components  results  in  the  following  expressions. 


*  InL  (V)  =  V 
dC.  c  s  m 

7  =  1  XXzy*WW 


(20) 


I  =  1  OT  =  1 


jhlnLp  (V) 


j  =  1 


(21) 


zf-inLu(  V)  =  o. 

^ im 


(22) 


Combining  these  expressions  and  setting  equal  to  zero, 


°[lnLc(\ |0  +/nLp(\j/)  +/nIM(\j/)]  =  0, 


gives 


(23) 


V  y  e  t?  -  r  l 

I—i  g  M  Za^jm^ij  S'S'm  ’ 


;  =  i 


7=1 

i  =  1  m  =  1 


(24) 


where  the  constant  arises  by  virtue  of  the  constraint  among  the  C,im  for  fixed  i.  Equation  (23)  must 
hold  independently  for  every  /.  Define  by 
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T...  = 


ijm  g  M 

X  X 

i  =  1  m  =  1 


(25) 


and  similarly  define  both  x^  and  C,im  through 


M 


T«  =  I  V  = 

m  =  1 


m  =  1 


5 


i  =  1 


g  M 

X  X 

i  =  1  m  =  1 


(26) 


Then  Equation  (24)  becomes 


y  x..  +  y  £.  x^.  =  C-C- 

,4-/  *ym  7/m  17 

y  =  i  7=1 


(27) 


By  summing  both  sides  over  m  for  fixed  i,  and  using  the  identity  Equation  (5),  the  constant  is 
found  to  be 


Thus 


72  71 


c,=  X4+X^- 

7=1  7=1 


(28) 


c/m  = 


nc  np 

Kj+Kj 

L;  =  1  7=1  ■ 


y  x..  +  y  t  x?. 

ym  Zw  ?/m  y 
1-7=1  7=1 


(29) 


which  completes  the  derivation  of  the  E-M  equation  for  the 
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E-M  Equations  for  7t 

Next,  derive  the  corresponding  equation  for  n  by  evaluating 

^  UnLc  (V)  +  lnLp  (¥)  +  lnLu  (\|/)  ]  =  0 .  (30) 


Recalling  that 


",  8  M 

lnL(\\f )  =  XfaX 

j= 1  i= 1 m  =  1 

np  M  g  nu  g 

+  X  X  X  W”  [  ]  +  x ln  X  • 

y  =  1  m  =  1  i  =  1  y  =  1  i  =  1 

it  is  found  that 


-^—InL  (v|/)  =  £  [Tc(j,i) - TcU,g )]  +  £  [7;  O',  0  -  Tp  (j,  g)  ] 
*’  j= 1  y  =  1 


+  X  [r,a.i)-r„0'.*)]. 

y  =  1 


where 


(31) 


A/ 


M 


Tc  (j>  0  = 


m  =  1 


g  M 

1  I 

i=lm=l 


_  y  ijm  _ 

^  Ki 
m  =  1  1 


c 

T..  X.. 

ijm  _  jj 


M 


Tc(],g)  =  -£ 


%gjm 

m=\L%S  J 


c 

-_!£/• 

71 


a/  {  £ .  x^.} 
L  ^/m  i/J 

O'.  0  =  I 

m  =  1 


TC- 


4 

v 


and 


(32) 


(33) 


(34) 
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TuUJ) 


f(Xj$j) 


X  [it/C^e,)] 

/=  1 


u 

h 

K.  ' 


(35) 


In  terms  of  these  expressions,  the  derivative  is  given  by 

nc  np 

=  X  {Tc{j,i)-Tci /,g)]  +  X  lTpU,i)-TpU,g)]  +  'Z  [ Tu(j,i)-Tu(j,g )] 

7  =  1 

nc 

■  X 


p\JT-J  -pUMJ/J  '  L*UVJ>-/  ~u\Jlb)l  (36) 

7=  1  7=1  7=1 


r  c 

x.. 

c  ” 

X  ■ 

nP 

IV.  /.] 

nu 

*  u 

x.. 

M  " 

X  • 

jj  _ 

7C. 

_  _£/ 

+  x 

jj _ gj 

71  i 

+  X 

JJ_ 

n. 

_  -U 
Tia 

L_  I 

g-> 

7=1 

L  l  g  J 

7=1 

L  2 

g-l 

Setting  this  equal  to  zero  results  in 


Wc 

r  c-i 

X  •  • 

nP 

r<fi 

nu 

r  “i 
x.. 

nc 

r  c  I 

X  . 

r/.i 

nu 

'  ill 

X  . 

X 

jj 

K- 

+  x 

_y 

7t. 

+  x 

_y 

7t  ■ 

■  X 

£7 

7t 

+  X 

gj 

+  x 

gj 

It. 

7=1 

L  i  J 

7=1 

L  j  J 

7=1 

L  « J 

7=1 

L  gJ 

7=1 

L  i  J 

7=1 

L  b  J 

=  C. 


Solving  for  C, 


(37) 


whence 


X4+X4+Xv 

7=1  7=1  7=1 


rt  p 

/>  6 


£ 


i  =  1 

The  final  result  is  thus 


(38) 


c  -  X  nic  =  X  X4+  X  X4+  X  XT*“  =  "c+'v+v  (39) 

7  =  17=  1  7  =  17=  1  7  =  1  i  =  1 


7t.  = 


( nc  +  np  +  nu ) 


XVX^X^ 

L7  =  1  7=1  7=1  - 


This  gives  the  maximum  likelihood  equation  for  n. 


(40) 
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E-M  Equations  for  fi 

Next,  derive  the  corresponding  equation  for  ji.  Start  with 

^j-[/nLc(\|/)  +lnLp(\\f)  +lnLu(\\f )]  =  0. 

Recalling, 


",  g  M 

lnL(\\i )  =  5>E 

j  =  1  i  =  1  m  =  1 

np  M  g  nu  g 

+  I  I  I  ^jnfijln  ttimKifiXjtfi)  ]  +  X  ^  j  ’ 

j = 1 w  =  1  i = 1  7=1  f  = 1 


gives 


-^—InL  (v|/) 
3p(. 


C  P  ,1U 

£  [?; o, oi  +  E  [r, a oi  +  £  [7„ o. o ] . 

J=  1  7 = 1  7=1 


Where 


rc  a;  o 


i=  1 


M 

I  V I  <-V^) /E«] 


m  =  1 


for  the  univariate  normal  mixture  model  case.  Similarly, 


and 


Tp  O',  8 ) 


M 


X  {  [  (X7  -  ^/)  =  <7  [  (Xj  -  H/)  /£«] 

m  =  1 


r.C/,0  =  <;[<*,-H,')/Xu]- 


When  put  together,  the  resulting  expression  is 


NSWCDD/TR-95/92 


d[Li 


-InL  (cp)  =  £  [T^O',0]  +  X  t^C/W)]  +  X  [^0,0] 


f  =  1  7=1  7=1 


«C  M 


(46) 


=  XX  [v^rW72*] + 1 [<7  <*,- to /£,,]  +  Z bo<-xr^/xiii  = 

j -  \m  = 1  7=1  7=1 

which  becomes 


Vi 


X  [<J +  X  [<]  +  Z  [%] 

•7=1  7=1  7=1 


=  Z  (vJ +  X  bfc}  +  Z  CvJ  •  <47> 

7=1  7=1  7=1 


whence  the  final  result  is  obtained 


M-i  =  L^L~ 
r‘  r  n , 


Z  [vJ  -  Z  K*J  4  Z  [  vJ 


-LsJL 


7=1 


XKJ*XK]*XK] 

1-7=1  7=1  7=1 


This  provides  the  joint  representation  mixture  model  expression  for  jll. 
E-M  Equations  for  X 

Finally,  derive  the  corresponding  equation  for  X.  Evaluating 


^~[lnLc(y)  +  lnLp{\f)  +lnLu(\ j/)  ]  =  0, 


with 


nc  g  M 

lnL(\\f)  =  I/nX  I 

7=1  i  = 1 m = 1 
nP  M  g 

+  X  X  X  ]  +  X /n  X  lt/(^9,) 

7  = 1 m  =  1  j  = 1  7  = i  i=l 

results  in 


g 


(48) 


(49) 
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3I -InL  (\|f)  =  X  [; TC  ( 7 .  0  ]  +  X  Wp  C/.  0  ]  +  X  [: O',  0  ]  (50) 

*'  7=1  7=1  7=1 

where 

M  (jc-u)2  1 

W'Xvir-E  (5,) 

m  =  1  |_  ^ii 

where  again,  attention  has  been  restricted  to  the  univariate  normal  mixture  model  case.  Similarly, 


T  (j  0)  =tp  _ l 

u  2  21..  ’ 

L  2Zi7  "J 


rp  /  •  *\  _  «  1 

'V  2X2  22« 

L  11 


When  put  together,  the  resulting  expression  is 


g^(V)  =  X  trcO,0]  +  X  t^0,0]  +  X  [Tu  0,0] 

'  7=1  7=1  7=1 


nc  r  2 

Y  c  (*,•  ~  1*,)  _  j_ 

U  2Z2  2£«7 


.'  +  w|lvV  1  1,  J_‘ 

»J  \  2X5  2S"J  ,  =  ,  1  2X5 


which  leads  to  the  final  result 


X  <j  (*j  -  h) 2  + 1  <1  -  n() : 2  +  X  <i  <*, -  ^ 


-  i=L 


X^X<+X*S 

-7=1  7=1  7=1 


This  provides  the  joint  representation  mixture  model  expression  for  X. 
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RESULTS 


SUMMARY  OF  UNIVARIATE  ITERATIVE  E-M  UPDATE  EQUATIONS 


where 


nc  np 

9=1  7=1  • 


n„  nn 


i  \ » + 1  %y,j 

9=1  7=1 


(56) 


ft.  = 


'  ( nc  +  np  +  tlJ 


nc  np  nu 

Kj+Kj+Kj 

1-7'  =  1  7=1  7=1  J 


(57) 


X  [vJ +  X  K*J +  X  [  vJ 

(X.  =  i=_L _ L=-l _ L= J _ 

[  X  [<] +  X  KJ +  X  [0 

lj=i  9=1  7=1 


(58) 


S<'(v^2+K(r^)2+XxS(y^): 

2.  =  _ L^J _ LzJ _ 


"c  np  nu 

I'l+Kj+Kt 

L7=l  7=1  7=1  ■ 


(59) 


x..  = 


Zjm^imn/(Xrei) 


ijm  g  M 

X  X  z9«c,*w/(^;0,-) 

i  =  1  m  =  1 


(60) 


and 
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M 


M 


-  2 


= 

"y  ^  ijm  g 
m  =  1 


_  _ l _ _ _ 

2  ^/(x/e,) 


2 


zn  =  1 


i  =  1 


*  M 

2  2 

i  =  1  m  =  1 


(61) 


Similarly 


j>  =  « = 
y  y  g 

X  VC*,’6,) 


,  =  i 


(62) 


where  it  is  to  be  remembered  that  x^.  is  only  computed  for  those  xy-  that  are  partially  categorized 
observations  and  similarly  for  x . . . 

The  E-M  algorithm  then  consists  of  iterating  the  Expectation  step  consisting  of  evaluating 
Equations  (60),  (61),  and  (62)  for  the  appropriate  observations  and  the  Maximization  step,  which 
consists  of  evaluating  new  parameter  values  using  Equations.  (56)  through  (59). 

The  multivariate  versions  can  be  obtained  by  making  Xj  and  p;  vector  quantities  and  Z(  a 
matrix.  The  equations  for  p,  and  2,-  become 


k  j  -  ] 

u  = 

r  n 


2  [v5  ♦  2  +  2  K-fl 

_ LzJ _ /  =  i 


2  ba  +  2  [<] +  2  K] 

-7  =  i  7  =  1  y'=i 


(63) 


and 


V  C  k  k  If  /  l  I  V  _p|  *  k  II  ,  /I  -T-.  Hi  A  *  II  1  / 

ZTyV7-^Ax7-^J+  I  \{xJ-'LiAxr'Lt)+  2  TyV7-^A^7-^ 
=  I=J _ _ 


it  *  y  /  / 


2V2<+K 

1-7=1  7=1  7=1  - 


(64) 


where  the  component  indices  are  denoted  by  superscripts. 
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Finally,  once  the  joint  representation  mixture  model  has  been  obtained  based  on  any  com¬ 
bination  of  class  categorized,  partial  class  categorized,  and  uncategorized  data,  if  desired  the 
probability  density  function  for  an  individual  class  can  be  obtained  through 

p  (j:|Class  y,  i|/)  =  — - - .  (65) 

£ 

i= i 

This  gives  a  properly  normalized  mixture  model  density  estimate  for  an  individual  class. 


CONCLUSIONS 

The  derivation  of  the  joint  representation  mixture  model  E-M  equations  has  been  pre¬ 
sented  for  mixtures  of  normal  components.  While  the  detailed  derivation  was  for  the  univariate 
case,  a  slightly  more  complicated  derivation  results  in  the  full  multivariate  equations.  The  results 
for  this  case  have  been  presented  without  derivation.  It  is  also  important  to  note  that  the  equations 
derived  for  £  and  n  are  not  restricted  to  the  use  of  normal  functions  in  the  mixture  model.  These 
equations  are  entirely  general  for  the  class  of  joint  representation  mixture  models  considered. 

The  joint  representation  approach  represents  a  significant  philosophical  departure  from 
current  mixture  model  usage.  The  standard  mixture  model  usage  is  either  to  build  a  separate  mix¬ 
ture  model  for  each  class  when  the  observations  are  class  labeled  or  to  assume  that  each  class  is 
normally  distributed  so  that  a  mixture  model  for  all  the  data  can  be  interpreted  as  a  mixture  of  nor¬ 
mal  classes.  The  approach,  in  effect,  totally  relaxes  the  requirement  for  each  class  to  be  normally 
distributed.  Philosophically,  a  semiparametric  viewpoint  has  been  taken  in  that  it  is  assumed  that 
each  class  can  be  modelled  by  a  (potentially  complex)  mixture  model  and  that  no  significance  is 
to  be  ascribed  to  an  individual  term  in  the  mixture.  As  an  example,  contrast  a  mixture  model 
approximation  to  a  lognormal  density  to  a  mixture  of  two  normals.  In  the  latter  case,  it  may  well 
make  sense  to  care  about  which  of  the  two  terms  gave  rise  to  a  particular  observation.  That  is, 
each  term  may  correspond  to  some  recognizable  class.  However,  in  the  lognormal  case  where  by 
assumption  the  density  is  nonparametric  with  respect  to  representation  by  normal  mixtures  (even 
though  it  has  been  modelled  it  with  a  parametric  approximation),  this  sort  of  distinction  has  little 
or  no  meaning.  In  other  words,  none  of  the  terms  used  to  model  the  lognormal  density  can  be 
given  any  independent  or  class  meaning. 

This  approach  is  thus  appropriate  for  combined  supervised/unsupervised  (various  levels  of 
class  categorization)  learning  when  the  individual  class  densities  may  be  more  complex  than  sim¬ 
ple  normals.  It  provides  a  unified  framework  for  handling  this  problem.  Once  the  joint  representa¬ 
tion  density  has  been  estimated,  densities  corresponding  to  the  individual  classes  can  be  easily 
recovered. 
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Future  reports  will  detail  the  derivation  of  recursive  versions  of  these  equations  as  well  as 
a  method  for  determining  the  complexity  of  the  joint  representation  mixture  model  in  a  data 
driven  manner. 
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